Predict Bike Sharing Demand with AutoGluon Template

Project: Predict Bike Sharing Demand with AutoGluon

Install packages

Setup Kaggle API Key

Download and explore dataset

Import Dependencies to start the project

Step 3: Train a model using AutoGluon’s Tabular Prediction

Requirements:

Review AutoGluon's training run with ranking of models that did the best.

Create predictions from test dataset

NOTE: Kaggle will reject the submission if we don't set everything to be > 0.

Set predictions to submission dataframe, save, and submit

View submission via the command line or in the web browser under the competition's page - My Submissions

Initial Kaggle score of 1.7632

Step 4: Exploratory Data Analysis and Feature Engineering

Let us start by taking a look at the feature names in the dataset

Histogram of all features

Evidently there are some features which are correlated -

Checking out the holiday and workingday features

Now since holiday is a categorical feature, we can one-hot encode it, and remove the original feature since it is not ordinal in nature.

Now since workingday is a categorical feature, we can one-hot encode it, and remove the original feature since it is not ordinal in nature.

Checking out the weather feature

It is slightly alarming to see that there is only 1 sample for weather = 4

From kaggle it is found that the descriptions for weather are:

It is evident that category 3 and 4 can be clustered together into one - Bad Weather. Category 1 refers to good clear weather, and Category 2 refers to Cloudy weather. Moreover Season type 4 has only 1 sample in the training data, so it does not really add much value to our analysis.

So in total we can have 3 Categories:

Checking the season feature

From Kaggle it can be seen that the season feature is categorized as follows:

Which means this is also a categorical feature which needs to be encoded, as the values are not ordinal.

First let us check the distribution of season and count values.

It is evident that as the season values change, indicating a change in season from 1 to 4, the count values also change, with an increase during summer and fall, and a decrease during winter and spring. So there is a direct relationship between the season and the bike demand counts

Let us one-hot encode these features, add them to our dataset and then remove the original feature

Checking out the atemp and temp columns

First let us check the distribution of atemp and count values.

There seems to be a roughly positive correlation to temperature and counts. However this isn't clear.

It is evident here that the atemp values are discrete in nature even though it is supposed to be continuous. We can thus discretize these values into buckets by creating one-hot vectors.

We can identify the buckets for categorizing the feature using a decision tree model in order to perform feature binning.

The idea is to find the best set of buckets or bins using a decision tree model that will involve correlation with the target variable.

Based on the above, we can see that the bins are structured as follows, since we have 6 nodes in the tree, we can bin the temperatures into 6 categories, which exhibit the highest entropy based splits:

We can now discard the original atemp and temp columns, however let us check the correlation between the two.

As is evident, these 2 are highly correlated. But the feature that we need to keep is the one that has a higher covariance with the target.

Since atemp and temp are both similarly correlated with the target, we can remove them both and have the one-hot encoded features.

Checking out the datetime feature

The datetime feature contains the datetime stamp in the format: YYYY-MM-DD HH:MM:SS

There are a few features than can be generated from this field, which we will corroborate with the following hypotheses:

First we start with the year feature we just generated. Let us check for annual trends in bike demand.

It is evident that there was a significant rise in bike demand in 2012. This may be due to external factors which we do not have access to, however since we have datapoints from 2011 in our test data, it makes sense that the year will play a part in predicting bike demand.

We need to convert the year into a categorical feature as well.

Currently we have integer values in year, which needs to be mapped into categorical features. We can one-hot encode these into new features, and drop the original year column.

Now let us look at the seasonal/monthly trends of count

As is evident, there is an obvious monthly/seasonal trend in bike demand as well, with more bikes being rented out from June - Sept.

We can model this variation by having categorical features denote the month as well.

Now let us look at the daily trends of count

This isn't really showing us much, as there aren't any obvious patterns or trends here. A better comparision would be to check whether there are trends between count and a weekday/weekend, and since the 1st of every month (or 2nd, 3rd etc) aren't always a weekday (or weekend) we need to figure out a way to extract that information first.

In order to do that we can use the $datetime$ module

Let us check the trend based on weekday/weekend with count

This seems to show more of a consistent trend, with:

This seems to be consistent with the idea that more people want bikes on weekdays rather than weekends.

We can thus bin these into 3 buckets:

Now based on the time_of_day feature we can engineer other new features, which will be categorical so we will have to one-hot encode them as well.

We start by figuring out the logic for the bins.

We find out the bins for the time of day similar to how we did for the atemp feature. We use a decision tree regressor to figure out the nodes of best split.

Let us corroborate this node split with the trends seen hourly as well

According to both the plots we can safely assume that the bins for the hourly trends look like:

As is evident the rush hour times see the most spike in demand. So we can use this as a feature.

Checking out the humidity and windspeed features.

From what it looks like, it might be better to leave humidity and windspeed as continuous variables rather than convert them into categorical. However since they are on a different scale from the other one-hot encoded features, it is imperative that they be standard-normalized so that they don't get extra weightage while the models train.

Let us look at the distributions of humidity and windpeed

Looks like humidity is almost normally distributed, so we can just standardize those values using min-max scaling.

However windspeed seems quite skewed. We can normalize windspeed, but there are quite a few 0s which seem like they are outliers. We can treat them as missing values/error values, and based on other features which affect windspeed, estimate the value of windspeed for those values.

We can then check the distribution to see if that has improved the distribution to be more normal.

We can make a preprocessing pipeline for the other features so that we can do a one-step preprocess of the train and test data, which can then be used to model both the count predictor as well as for the windspeed predictor

Make category types for these so models know they are not just numbers

Define a pre-processing pipeline function

Same for test data

Run an estimation model for the windspeed feature

We first combine the train and test data for this

Necessary features for estimating windspeed are:

As we have seen above, the windspeed feature is very skewed. We need to transform it so it approximates a gaussian distribution. This will allow us to estimate it better.

For this purpose we use the box-cox transformation from the scipy module

Using XGBoost to model the windspeed

using this trained model to estimate the box-cox transformed windspeeds of the test set (the 0 values)

Now we can normalize the windspeed values using min-max transformation

Last step remaining is to transform the count variable since it is also a skewed variable.

However we have to keep in mind that the actual predictions to be submitted need to be transformed back.

Count seems to be normally distributed now with these values, so let us run the model to predict these transformed counts, and then do an inverse boxcox transform on the predicted values to get the actual counts back

Step 5: Rerun the model with the same settings as before, just with more features

Run the model again on the new datasets

Let us do the inverse box cox to get the actual count values

New Score of 0.529

Step 6: Hyper parameter optimization

New Score of 0.52538

A Different Approach: ln(count+1)

Since the count variable is highly skewed, and the validation metric used on Kaggle is RMSLE (Root Mean Squared Log Error), it makes sense that we train our models on the natural log of the count variable ln(count + 1). This will allow the models to calculate the RMSE on the logs of the actual and predicted values, effectively allowing us to model on the metric of RMSLE.

First let us get rid of the BCT'd count values and replace it with the ln(count + 1) values. There is a +1 here as there are some count values which are 0, and ln requires all positive values.

This is fine however as it won't affect our metrics too much. RMSLE has a lower penalty for overestimated values, and a large penalty for underestimated values.

Get the actual predictions back by taking the exponent and subtracting 1

That seemed to have worked quite well, the new score is 0.517

Step 7: Write a Report

Refer to the markdown file for the full report

Creating plots and table for report